DMDD: A Large-Scale Dataset for Dataset Mentions Detection

نویسندگان

چکیده

Abstract The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora mention detection are limited size naming diversity. In this paper, we introduce the Dataset Mentions Detection (DMDD), largest publicly available corpus task. DMDD consists main corpus, comprising 31,219 articles with over 449,000 mentions weakly annotated format in-text spans, an evaluation set, which comprises 450 manually purposes. We use establish baseline performance linking. By analyzing various models on DMDD, able open problems detection. invite community our as challenge develop novel models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Jacquard: A Large Scale Dataset for Robotic Grasp Detection

Grasping skill is a major ability that a wide number of real-life applications require for robotisation. Stateof-the-art robotic grasping methods perform prediction of object grasp locations based on deep neural networks which require huge amount of labeled data for training and prove impracticable in robotics. In this paper, we propose to generate a large scale synthetic dataset with ground tr...

متن کامل

VoxCeleb: A Large-Scale Speaker Identification Dataset

Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected ‘in the wild’. We make two contributions. First, we propose a fully automated pipeline based on computer vision technique...

متن کامل

A Diverse Large-scale Dataset for Evaluating Rebroadcast Attacks

We describe the acquisition of a large, diverse set of rebroadcast images captured by a screen-grab, scanning a printed photo, or rephotographing a displayed or a printed photo. This dataset consists of 14, 500 rebroadcast images captured from a diverse set of devices: 234 displays, 173 scanners, 282 printers, and 180 recapture cameras. The diversity of this dataset—across devices and types of ...

متن کامل

A Large-scale Dataset and Benchmark for Similar Trademark Retrieval

Trademark retrieval (TR) has become an important yet challenging problem due to an ever increasing trend in trademark applications and infringement incidents. There have been many promising attempts for the TR problem, which, however, fell impracticable since they were evaluated with limited and mostly trivial datasets. In this paper, we provide a large-scale dataset with benchmark queries with...

متن کامل

Large-scale Multiview 3D Hand Pose Dataset

Accurate hand pose estimation at joint level has several uses on human-robot interaction, user interfacing and virtual reality applications. Yet, it currently is not a solved problem. The novel deep learning techniques could make a great improvement on this matter but they need a huge amount of annotated data. The hand pose datasets released so far present some issues that make them impossible ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Transactions of the Association for Computational Linguistics

سال: 2023

ISSN: ['2307-387X']

DOI: https://doi.org/10.1162/tacl_a_00592